There Is No Preview Available For This Item
This item does not appear to have any files that can be experienced on Archive.org.
Please download files in this item to interact with them on your computer.
Show all files
With the significant progress made in the past two decades in speech technologies, spoken dialog systems have been able to progressively handle more and more complex tasks, while at the same time becoming more sophisticated in handling the structure and the uncertainties inherent to conversation. Yet, at the lower level of timing and turn-taking, current systems are still rigid and brittle, particularly when natural language input is accepted. In this talk, I will focus specifically on the problem of end-of-turn detection, which has typically been handled by a pause detection mechanism combined with a fixed threshold on the duration of the pause (e.g. 'consider that the user has finished their utterance when they pause for 700 ms or more'). The limitations of this approach are obvious. If the threshold is short, the system will be prone to interrupting the user in the middle of their turn ('cut-ins'), whereas if it's long, system latency will suffer. In order to address these issues, we designed an algorithm to dynamically set the threshold for each pause using features available from different levels of dialog, from speech recognition scores, to prosody, to semantic interpretations, to discourse structure. By combining these features in a single decision tree, we were able to reduce system latency for a fixed cut-in rate up to 24% in a publicly deployed spoken dialog system.We then moved one step further and frame turn-taking as a dynamic decision process, i.e. one in which time is an important factor in the utility/cost of each possible action. By grounding the problem in a well established theoretical framework, we have been able to 1) improve over our previous results with latency reductions over a fixed threshold baseline of up to 35%, 2) integrate the turn-taking mechanism in a general spoken dialog architecture that captures the relationship between low levels of interaction and higher discourse structure, and 3) generalize the approach to other aspects of turn-taking, such as interruption detection.